WO01090915 



Publication Title: 
No title available 

Abstract: 

Abstract not available for WO01090915 

Data supplied from the esp@cenet database - Worldwide 



Courtesy of http://v3.espacenet.com 



hi 

This Patent PDF Generated by Patent Fetcher(TM), a service of Strol<e of Color, Inc. 



(12) INTERNATIONAL APPLICATION PUBLISHED UNDER THE PATENT COOPERATION TREATY (PCT) 



my^on,ln^^^^Or^.aia.t^n |ii||lliillililiiilllMIIII 

(43) International Publication Date («») International PubUcatlon Number 

29 November 2001 (29.11.2001) PCX WO 01/90915 A2 



(51) International Patent Classification^: 



128 - 3280 East 58th Avenue, Vancouver, British Columbia 
V5S 3T2 (CA). 



(21) International Application Number: PCT/CAOl/00712 ^^^^ MANNING, Gavin, N.; Oyen Wiggs Green & 

Mutala, 480 - 601 West Cordova Street, Vancouver, British 



(22) Interaattonal FiUng Date: 22 May 2001 (22.05.2001) 



(25) Filing Language: 
{26} Publication Language: 



(30) Priority Data: 
09/576,871 



English 
English 

22 May 2000 (22.05.2000) US 



(71) Applicant ffor all designated States except US): 

GAZELLE TECHNOLOGY CORPORATION 

[CA/CA]; 250-3665 Kingsway, Vancouver, British 
Columbia V5R5W2(CA). 

s (72) Inventors; and 

: (75) Inventors/Applicants (fa- US only)'. WILSON, Jeremy 
= [CA/CA]; 1 1045 Southridge Rd, North Etelta, British Co- 
= lumbia V4E 2M3 (CA). STRAYER, Jayson, D. [CA/CA]; 



Columbia V6B IGl (CA). 

(81) Designated States (national): AE, AO, AL, AM, AT, AT 
(utility model), AU, AZ, BA, BB, BG, BR, BY, BZ, CA, 
CH, CN, CR, CU, CZ, CZ (utiUty model), DE, DE (utility 
model), DK, DK (utility model), DM, DZ, EE, EE (utility 
mo^l). ES, FI, n (utiHty model), GB, GD, GE, GH, GM, 
HR, HU, ID, IL, IN. IS, JP, KE, KG, KP, KR, KZ, LC, LK, 
LR, LS, LT, LU. LV, MA, MD, MG, MK, MN. MW, MX, 
MZ, NO, NZ, PL. FT, RO, RU, SD, SE, SG, SI, SK, SK 
(utility model). SL, TJ. TM, TR, XT, TZ, UA, UG, US, UZ, 
VN, YU. ZA, ZW. 

(84) Designated States (regional): ARIPO patent (GH, GM, 
KE, LS, MW, MZ, SD, SL, SZ, TZ, UG, ZW), Eurasian 
patent (AM, AZ, BY, KG, KZ, MD, RU, TJ, TM), European 
patent (AT. BE, CH. CY, DE, DK, ES, FI, FR, GB, GR, IB, 
rr, LU, MC, NL. PT. SE, TR), OAH patent (BF, BJ, CF, 
CG, CI, CM. GA. GN. GW, ML, MR, NE, SN, TD, TG). 

[Corainued on next page] 



I (54) Title: PROCESSOR ARRAY AND PARALLEL DATA PROCESSING METHODS 




< 

IT) 

1—1 
OS 



® (57) Abstract: An array of processor elements has multiple instruction streams and multiple data streams broadcast to all of the 
O processor elements. The processor elements are each connected to multiple neighbouring processor elements within a cruciate 
neighbourhood. The architecture is suitable for use in fine-grained applications. The array may have a processor element for each 
^ pixel of an image. The array is preferably provided on a single integrated circuit having 10,000 or more processor elements. 



wo 01/90915 A2 iillllinfflllllHiiiiiliin 



Published: fbr hvo-letter codes and other abbreviations, refer to the "Guid- 

— without international search report and to be republished ance Notes on Codes and A bbreviations " appearing at the begin- 
upon receipt of that report ning of each regular issue of the PCT Gazette. 



wo 01/90915 



PCT/CAOl/00712 



PROCESSOR mnXY AND PAR ATXEL DATA FRQCESS TO MEXHQPg 

Te<;hmc9l?ia4 

This invention relates to computers. In particular the invention rdates to 
massively i»ra31el computers having procesMjr anays and metihods for using arrays 
of processors to solve problems. Specific embodimoits of the invention are 
particularly useful for image processing. 

B^kground 

Image procesang is both computationally intensive and data intensive. By 
way of example, using an MPEG ("Motion Kcture Esperts Group") image 
compression algorithm to compress a 20 M^gabytes-per-second television signal in 
real time may require on the ord©: of 200 WllicMi arithmetic operations per second. 
The goal of providing cost effective computra: systans capable of providing the 
extremely high throughput required for hnage procesang and ^milar tasks has so 
far eluded the computer industry. 

One way to achieve higher throughput in computer image processing 
systems is to use a higher speed processor. The processor could be any of several 
types commonly in use, such as RISC (reduced instruction set computer), CISC 
(complex instruction set computer), DSP (digital signal processor), or VUW (very 
long instruction word). A basic problem with applying a high speed processor to 
data intensive applications such as image processing is that the processor typically 
spends a significant amount of time moving data to and J&om flie memory. Furfli^, 
wh^ a single processor is used, the inher^tiy paraUel nature of many image 
processing algorithms must be brokrai down by the pt<^;ranmier into a serial 
program wMdti works with one or at most a few juxels at a time. 

Another common sytproach to achieving real-time performance in difficult 
image proces»ng applications is to build custom hardware to perform the image 
processing. To do so, a problem is typically broken down into its main functional 
steps, and each step is implemented by different hardware sub systems. The 
hardware may be provided on an application specific integrated circuit (ASIC) or 
the like. Such hardware-based solutions do not typically scale up very well to larger 
image sizes, nor are they readily applicable to oth^ problems. 

A further way to achieve higher throughput is to divide the image processing 
task between many processor eleme«ts (PEs). For inherently two-dimensional (2D) 
problems, such as image processang, which deal with 2-dimensional arrays of data 
. elemenli, such as pixels, it is natural to arrange a number of processing elements so 
that each processing element is logically arranged at a node of a 2-dimensionaI grid. 
Local connections are provided between neighbouring processors. A natural way to 
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implemoit many 2D problems is to assign a single processor dem&it to each data 
elem^t That is, to provide processor demits arranged at nodes of a mesh which 
has the same dimensions as the array of data elements that it manipulates. Hi^ are 
many examples of the use of computer processor arrays for solving image 
processing and other computational problems. 

An architecture that assigns only a few data elements per processor element 
is termed "fine-grained". In contrast, a coarse grained architecture has many data 
elements assigned to each processor element M. J. Flynn Very High Speed 
Computing Systems, Proceedings of the IEEE, Vol. 54, No. 12, pp. 1901-1909 
(1966) categorized parallel processing computing systems into three categories: 
SIMD (single instruction stream, multiple data streams), MCMD (multiple 
instruction streams, multiple data streams) and MISD (multiple instruction streams, 
single data stream). In a SIMD system, the same instruction is broadcast to all 
processor elements. Each processor element has its own set of registers along with 
some means for it to recdve unique data (sudh as a data value for a particular pixel 
in an image). In SIMD systems each individual processor demoit can be simple 
because it does not require a separate program counter or logic for fetching 
instructions from memory. Consequently, SIMD arrays can be well suited for fine- 
grained architecture. 

In MIMD architectures every processor dement has its own program store 
and can operate ind^)endently of otho" processor dements. A MIMD processor 
array may also be termed a "multi-computer", l^cause eadi processor elemrait is 
full computer in its own right MEMD architectures are not as well suited to fine- 
grained problems such as image processing because each processor element in a 
MIMD array is more complicated than, and requires larger circuits than, its 
counterpart in a SIMD array. Furthrar, inter-processor contention for shared 
resources is an issue because the processor elements in a MIMD array operate 
independentty. 

In MISD architectures a dngle stream of data is passed along a chain of 
processors with a diffiaent operation praformed at each step in ttie chain. Systems 
which implement MISD architectures are more commonly referred to as systolic 
arrays, and are wdl suited to signal processing and video scan line processing, but 
not wdl suited to prd)lems sudi as image compresdon that require two-dim^sional 
opmtions. 

In a SIMD array it is difficult to impl^oit algorithms where one group of 
process^ elemaits is required to operate differaitty from another group of 
processor elements. In some SIMD architectures individual processor elements can 
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conditionally skqj instructions (SIMD architectures without this capability can 
achieve the effect of condition statements through more complicated mathematical 
expressions). 

Models for studying and modelling parallel computing have been proposed 
in which there are multiple instruction streams each of which is provided to a 
specific set of processing elemraits and multiple data streams. Such modds are 
termed MSIMD models. Typically each instruction stream is associated with a 
specific data stream. 

A key ptoblm with wring any paralM array of processors is to program the 
process(M:s in the array in moh a way that the parallelism is wdl utilized (Le. so that 
a good proportion of the processors are kept busy most, of the time). As a simple 
example, consider the following conditional branch structure, coded in the C 
programming language. Such a conditional sequraice might occur whore tiie 
bdiaviour of some processor demaits (e.g. processor elements processing pixels 
which are located at the boundary of an image) needs to be diffi^cent from all other 
processor elements. 

if (rO == 0) 
{ 

/* Sequence A for non-boundary pixels*/ 

} 

else 
{ 

/* Sequence B for boundary pixels*/ 
} 

In this example, lO is the symbolic name for a register in each processor 
damrat Hie processor elemoit executes d&ec sequ&ace A or sequence B 
depending on the state of its rO roister. It can be {qspredated tiiat if sequence A and 
sequaace B are equally long thai each processor dement will be utili^ only 50% 
of the time because it will have to skip one or other of the conditional branches. 
The processor elemoits all recdve the same instruction stream. While a processor 
dement is skipping instructions it is not performing useful work. 

Ailtable lookup operation is anotiier example of inefficient utilization of a 
paralld array. Consider a table lookup operation wherein each processor dement is 
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lequiied to retrieve an elemoit firom a table based on tibe contents of a raster. 
Table lookup operations of this tjipe are used commonly, for example, to implement 
such tasks as colour correction, contrast enhancement, or texture mapping. 
Typically the table is much larger than the memory available at each processor 
element. Evai if there were sufficient data storage at each processor element it 
would be a poor use of memory resources to have a copy of the same table in the 
memory of ev^ processor element. Since each processor element requires access 
to a specific element of the table either the table will be stored in an external 
memory the entire table must be broadcast to every processor element. If the table 
is stored in an external memory then there will be contention problems caused by a 
large number of processor dements attempting simultaneously to access the table. If 
the table is broadcast to all of the processor dements thra each processor element 
watts until the appropriate table value is broadcast, and stores only this value. It 
ignores all other values. It can be appredated that processor utilization is very low 
during such look-up opecatbns. Bv&a. if the contents of a table are broadcast to 
processor elemrats in a numb^ of data streams each processing elemrat must do 
significant work to obtain the (me value £com the table that it requires. This 
inosaiKs power consumption of the processor anay. 

An important characteristic of massively parallel architectures is tiie way in 
which processor elements are interconnected with one another. Various 
interconnection schemes are known. For example, U.S. patent No, 4,314,349 
discloses a typical architecture whwein each processor element is connected to its 
immediate neighbours to the "north", "south", "east", and "west". A problem 
with such limited connectivity is tiiat any translation operation (combination of 
horizontal and vertical shifts) can only be implemraited as a single processor 
element step at a time. This is especially a problem for any algorithm that needs to 
compute a single result tiiat involves all data elements, such as determining the 
maximum pixd value in an image. In a "four connected neighbourhood" 
architecture as exemplified by U.S. patent No. 4,314,349, it takes at least If x C 
operations to obtain sudi a value, where K is tiie number of rows in the processor 
array and C is the number of cdomns in the processor ioray. The overall result is 
tiiat individual processor dtements spend a lot of time idle while values propa^te 
tiux>ugh tile rest of tiie array. A further problem witii sudi limited ccmnectivi^ is 
that the array cannot readily process volumetric (three dimensional) image data 
because flie PEs cannot be reconfigured into a mesh rq»resaiting a tiiree 
dim^m^al structure. 
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It is also known to connect processor dem«its at a border of an array to 
corresponding processor elemoits on the oppodte border. U.S. patent No. 
5,590,356 discloses an example of such a "torus" architecture. While improving the 
efficiency of obtain image operations, a torus architecture still does not help the 
global evaluation problem, and it introduces long wiring paths (from one edge of 
the array to another) toat impose lower limits on the data transfer rate betweoi 
processor etemeats because of the propagation delays along these long palhs. 

Some architectures have a much higher degree of connectivity. For 
example, U.S. patent No. 4,805,091, describes an array of processor dements 
logically arranged at nodes of a many-dimensicMial hyper-cube and a message 
routing systwn which pocmite each processor daaient to pass packets of data to 
another processor element with few intervening steps. While it can achieve more 
ef5Bcient processor utilization than the architectures described above, this type of 
architecture is difficult to implement in a monolithic array. Long path propagation 
delays adversely affect the scaleabUity of the system. 

Large arrays of processors can often be made fault tolerant so that, if one or 
more processors are defective, tiieir functions can be assumed by spare processors. 
There are a number examples of fault tolerant processor arrays in the academic and 
patent literature including those disclosed in U.S. patrait Nos. 4,314,349; 
5,625,836; 5,590,356; 5,748,872; 5,956,274; and, 4,722,084. Fault tolerance in 
memory arrays (e.g. as described by patents US6032264, and US5920515) has 
proven very b«ieficial to reducing their price because fault tolerance greatly 
increases the yield of operational chips. Hiis is especially important because 
memories are typically very high denaly, and so es^edally soisitive to defects. It is 
much more difficult to provide a &ult tdaant processor array than it is to provide a 
fEult toleiant memory array because the cells in a memory amy do not need to 
communicate with each odier as do ti» processors in a processor array. So if a 
defect in a memory anay is avoided by r^ladng an entire row or column, it is not 
necessary for the rq>lacem»it row or cdunm to be located physically adjacent to die 
defect. However, in a processor array, any fault correction scheme must replace 
tiie defective cell in sudi a way tiiat all the local intra-connections are implem^ted. 

There is a need for cost effective computer systems capable of effid^tly 
handling multi-dimensional problems, such as image processing. There is a 
particular need for such systems capable of handling streams of data, such as video 
image data in real time. There is a particular need for such systems which are 
scalable, jthiough a wide range of array sizes with a minimum of software or 
hardware changes. 



wo 01/90915 



PCT/CAOl/00712 



-6- 

Summarv of the Invention 

This invention provides arrays of processor elements which have advantages 
over the prior art. One aspect of the invention provides a processor array 
com:^ising a plurality of interconnected processor elements, a plurality of 
instruction buses connected to each of ttie processor elements, at least one data bus 
connected to each of the processor elemwits and a instruction selection switch 
assodated with each of the processor elements. Dififerwit processors in the array can 
be pedbrming instructions in different instruction streams. Each processor elem^t 
is connected to execute instructions from one of the plurali^ of instruction buses as 
selected by its instruction selectim switch. 

In preferred mbodimoits each of the processing elements comprises an 
instruction bus setection r^ter and the instruction selection switch is constructed 
to select a one of the plurality of instruction buses corre^nding to a data value in 
the mstruction bus selection register. The contrats of the insbiiction bus selection 
register can be changed under software contixd. 

Most preferably the array comprises a plurality of data buses connected to 
each of tiie processor droits. A data selection switoh assodated with each of the 
processor elements can be used to select one of tiie data buses. Eadi processor 
element can be connected to receive data from a one of the plurality of data buses 
selected by its data sdection switch. The data buses are not necessarily associated 
witii any particular instruction stream. 

In preferred embodimOTts, 1 wherdn each of the processor demaits is 
connected to s^d data to and recdve data from other processor elemraits in a 
crudato neighbourhood. 

Anotho: aspect of the invention provides a processor array comprising a 
plurality of interconnected processor elemraits. Each of the processor elements is 
logically arranged at an intersection of a row and a column in a grid comprising a 
plurality of rows and a plurality of columns. Each of the processor eilements is 
connected to transmit data to a plurality of neighbouring processor demmts. The 
plurality of ndgfabouring processor dements comprising a number N> 1 of 
processor demits in tiie column on aSxex ade of tiie processor demooit and a 
number M >1 of processor donents in (he row <»i dtfaer side of the processor 
elsttieat In some embodimoits > 4 and if > 4. There may be diffoent 
numb^ of neighbouring processor dments oa dtiier side of a processor element. 

A further aspect of the invention provides a method for operating a 
processor array comprising a plurality of processor drnwits. Each of the processor 
dements has a plurality of registers which require periodic refreshing at a refresh 
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frequency. The method comprises providing one or more streams of instructions to 
each of the processor demaits for execution by the processor dements and, 
periodically inserting into the one or more instruction streams register refresh 
instructitms, the register refresh instructions causing the processor elements to 
rewrite data values in the registers. Preferably the processor elemait is left in the 
same state after execution of a refresh instruction as it was before execution of the 
refresh instruction. Hiis permits refresh instructions to be insetted at any time, as 
required. 

A still furtiier aspect of the invoifion provides a method for operating a 
processor array having a plurality of interconnected processor dements. The method 
comprises providing an array of processor dements, each of the processor ekmaits 
logically arranged at an uitrarsection of a row and a column in a grid comprising a 
plurality of rows and a plurality of columns. Each of the processor elements is 
connected to transmit data to a plurality of ndghbouring processor dements, the 
plurality of ndghbouring processor elements comprismg a numb» N of processor 
elements in the column on either side of the processor dement and a number M of 
processor elem«its in the row on dther side of the processor dement The method 
continues by determining when one or more of the processor elemwits is defective; 
and, for each defective one of the processor elem«its, ignoring dther the row or 
column containing the defective one of the processor dements. The shape of the 
ndghbourhoods permits rows and/or colunms to be ignored while preserving the 
functionality of the processor array. 

A still further a^t of the invention provides a method for implementing a 
table looiaq> operation in a processor array. Tba mdfaod comprises: providing a 
processor array comprising a plurality of processor deanraits; proividing multiple 
data streams to eadi processor djsment; providing a lookup table compri»ng several 
parts eadi part corresponding to a range of values, each of the parts comprising one 
or more table values; simultaneously transmitting tiie several, parts of the lookup 
table on the multi^lB data streams; at each processor dement selecting a data stream 
to access as a function of a data value in the processor demoit; and, at each 
processor demoit tdrieving from the sdected data stream a t£d>Ie value 
corresponding to tt» data vali» of tiie processor dement 

Furtiier features and advantages of the inv(»ition are desoibed below. 



ppef Pescriptipn of the Drawings 

Ittlfigures which illusteate non-limiting embodiments of tilie invention: 
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Figure 1 is a schematic view of a system including a processor array 
according to the invoition; 

Figure 2 is a block diagram of a processor dement for use in the invention; 

Figure 3 illustrates tiie local connectivity of a processor element in a 
processor array according to a preferred mbodiment of the invention; 

Figure 4 illustrates an alteration in the local connectivity of the processor 
array of Figure 3 to accommodate a defective processor dement; 

Figure 5 is a partial schematic block diagram illustrating die connection of 
read and write edge roisters to processor el^naits in a column of a processor array 
according to a specific embodiment of the invention; 

Figure 6 is a simplified schematic diagram illustrating a possible 
construction for a ndghbour access logic circuit for use in a processor dem^t; 

Figure 7 is a sdiematic diagram illustrating a possible constiiiction for 
removing defective processor dem«its from operation; and, 

Figure 8 is a sdi^natic diagram for a pcxsstble defect logic drcuit. 
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Description 

I, Overview 

Figure lis a schematic view which illustrates the overall structure of a 
system 10, according to a currently preferred embodimwit of the invoition. System 
10 includes a processor array 11. Array 11 is preferably constructed <mi a single 
integrated circuit. Array 11 comprises a large number of processor demaits 12 
arranged in a 2-dini»isional topology. Eadi processor dement 12 is logically 
arranged at an intersection of a row 15A and a column 15B in a grid comprising a 
plurality of rows and a plurality of columns. A lyirical array 11 for inwge 
processing applications could have in excess of 10,000 processor demaits 12. A 
processor array according to the inveatim might, fcxc example, have 19,200 
processor dmoits 12 logically arranged in 1^ rows and 120 colunms. A processor 
array according to the invoilion could also comprise a long narrow array. For 
exanqile, Ae array could have a nurnb^ of columns equal to or slightly greater than 
the number of pixds in a row of an image to be processed and a few rows, for 
example 8 to 16 rows. Sudi an array might, for example, have 5760 processor 
elemetitsWranged in 720 columns and 8 rows. Preferably all of the processor 
elements 12 of array 11 are fabricated on a single semiconductor wafer. Control 
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signals such as system timing signals £n>m a system clock (mt shown) are provided 
to processor dements 12 by way of a control bus 29 figure 2). 

As is typical in SIMD architectures, instructions and data values are 
broadcast to esv&y processor element 12. However, in contrast to previous SIMD 
architectures, array 11 provides multiple instruction streams 14 and multiple data 
streams 16 which are simultaneously broadcast to every procesror element 12. In a 
currently preferred embodiment of the invention thare are 16 broadcast streams, 
indicated generally by the reference numeral 17, each of which may be used either 
as an instruction stream 14 or as a data stream 16. Each broadcast i^zeam is carded 
by a suitable bus. In this specification die term "bus" has (he broad meaning "a 
signal route along which data signals can be passed". 

The operation of array 11 is coordinated by a controller 18. Controller 18 
may comprise, for example, a conventional CPU (which could be a RSIC, CISC, 
DSP, or VLIW aiddtecture) running software instructions stored in a memory 19. 
Controller 18 manages array 11 by caudng appropriate broadcast streams 17 to be 
deliverad to processor dlemoits 12 ficom an array program and data memory 20 and 
coordinating direct memory access (DMA) operatimis of DMA controller 26 as 
desmbed below. Controller 18 could be int^rated on a single chip with processor 
elemrats 12 or could exist off-diip as a separate component. For video processing 
applications system 10 prefiaably includes a vi^ decoder 33 and a video encoder 
36. 

As described below, the incorporation of multiple broadcast streams 17 
which can be configured to provide multiple data streams and multiple instruction 
streams makes it possible to perform certain operations, such as table look ups very 
efficientiy. Furthermore, The ardbitecture of system 10 can be operated in certain 
circumstances to provide reduced power consumption as conq[>ared to prior 
architectures. 

Preferred Construction of Processor Hemente 

As shown in Figure 2, each processor elmoit 12 has a set of resistors 
indicated generally by 21. Some of registers 21 are gogcaal purpose renters 21A 
which processor dem«it 12 can use for storing data and the results of computations. 
Other ones of regist^ 21 are control legistars 21B which have i^>ecial purposes. 
Each processor element 12 has an instruction select register 22 (Fig. 2). The 
contents of instruction select register 22 contt:ols vviiich one of broadcast streams 17 
processiir dement 12 will look to for instructions to be executed on the processor 
element 12. In die illustrated embodiment, an instruction stream select switdi 
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controlled by tiie value stored in register 22 selects Instructions from one instruction 
stream 14 and delivers the selected instructions to processor element 12. Each 
processor element 12 also has a data select register 23. The contents of data select 
register 23 controls which one of broadcast streams 17 will be looked to by the 
processor element for data. In the iUusttaHed embodimait, a data stream select 
switch 38, whidi is controlled by a value stored in roister 23, can sdect data from 
one data stream 16 and make data from die selected data stream available to 
processor dem^t 12. 



Table Ilists a possible complement of registers for a processor element 12 
having 128 possible register addresses. 



TABLEI; 


ADDRESS 


DESCRIPTION 


0-7 


special purpose registers 


8-15 


control, status, instruction stream selection, data stream 
selection etc. 


16-31 


general purpose registers 


32-63 


read only data streams (accessed in die same manner as 
data in registers) 


64-127 


read only dsUa from neighbouring processor elements 
(accessed in the same manner as data in r^stexs) 



Registers 22 and 23 can be modified by processor element 12 und^ program 
control. The instruction set for processor elem^ts 12 include instructions that 
cause the processor client 12 to switch to a different instruction stream or to 
switch to a different data stream. Switching to a different instruction stream can be 
used, as described below, to achieve a function similar to that of a "jump" 
instruction in a conv^tional serial processor. Switching to a particular data stream 
can be used to enhance table look iq>s. 

The spedfic implementation of a processor elem^it 12 i^own in Figure 2 
has 16 general raters, each 16-bits wide. Each general, purpose register can 
convenioitly store a colour jHxd value, two 8-bit pixdis, or the result of an 8-bit by 
8-bit multiplicati(m. Instructions are also ^pically 16-bits wide. The processor 
dements are preferably individually voy small so that a large array 11 can be 
fabricated on a single chip using suitable VLSI Mdcation techniques. 

•i^o maintam processor elements 12 small and closely packed, data paths 
connecting to each processor element 12 and data paths within a processor element 
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12 are pitiably swial. In the prefened embodimrait, all data and instructions are 
shifted into and out of each processor element in bit saial fashion; all instruction 
and data buses are 1-bit wide; and all arithmetic and logic operations are performed 
in a bit serial manner. 

Where data streams 16 are serial thai processor daments 12 can read data 
from any selected data stream 16 as if the data stream were a local register. The bits 
from data stream 16 are read sequentially into a register in processor element 12. It 
is not necessary to p^o^dde separate local buffers for storing data from data streams 
16 so that it can be read by processor elemait 12. This can fiir&er reduce the size 
and complicity of processor dements 12. 

In the preferred embodiment of the invention, while processor dement 12 is 
wcecuting one instruction, a next instruction is being read into processor element 12 
from the currentty active instruction stream 14. It is typically not possible to 
commence performing an instrucdon until an mtiie instruction has been received. 
Where processor dement 12 operates serially it is, however, possible to operate on 
data as it is recdved since, as noted above, reading data from a serial data stream 

16 is not significantly diffwent from reading the same data from a local serial 
register. This makes it desirable to shift data streams 16 by one cyde relative to the 
instruction streams 14 which contain instructions for operating on the data of data 
streams 16. It is convenient to reserve one group of broadcast streams 17 for 
instructions and anodier group of broadcast streams 17 for data. 

Each processor dement 12 has an ALU (arithmetic and logic unit) 13. In the 
preferred embodiment ALU 13 is preferably a simple 2-bit to 1-bit ALU capable of 
any 2:1 logic opaation, addition, and subtraction. Multiplication can be achieved 
through a sequence of operations involving addition and bit shifting. While such a 
bit-serial implementation means that eadi processor dement 12 runs approximatdy 

17 times slower (for a 16-bit word length) than it could in a Wt-paralld 
implementation, the overall result of bdng able to padc more processor dements 12 
into the same siUcon area provides a fine-grained paraHdism that is a more natural 
fit to image related computation problems. Further, with a sraial implementation it 
is possible to connect each processor dement 12 to mqie instruction streams, data 
streams and ndghbouring processor elements than would be practical using an 
implOTiratation in which instruction streams, data streams and connections to 
neighbouring processor elements were made using data paths which carry paralld 
data. 

A fiirtha- benefit of using serial shift registers in processor elements 12 is 
that the registers 21 can be implemaited as dynamic memory rather than static 
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memory. The serial execution process naturally refreshes the contents of any 
registers used by an instruction. Dynamic roisters can typically be implemented 
with fewer transistors or other circuit eirajoits per bit of storage tiian can static 
registers. This permits further reduction in the area occupied by each processor 
elemCTit 12. If registers 21 must be refredied at a rate of a few KHz, at a 10 MHz 
instiaiction rate, it is a small ovrahead (about 1%) to insert instructions in the 
instruction streams whidi do notiiing otiia: tiian ie&^ the values in roisters 21. 
This apiuoach avoids tiie need for any speasl refresh logic As a furtfier refinemwit, 
controller 18 could track tiie usage of raters in array 11 and ini^ refresh 
instructions into instruction streams 14 on an as-needed basis. After execution of a 
refresh instruction a processor element 12 should preferably be in the same state that 
it was before eicecutiQn of the refresh instruction so tiiat refresh instnrctions can be 
inserted at any point in an instruction stream without affecting any processes 
running on the processor dlemetit 

A conventional memory, 34 sudi as DRAM or SRAM may be integrated 
with array 11 for additional data image storage. This storage could be off chip, or 
integrated on chip. As best shown in Figures 1, 2 and 5, to provide input data to 
processor array 11, (for example, to provide image data to array 11) and to retrieve 
results computed by array 11, there is a set of "edge Vo" raters 24, 25. Registers 
24 and 25 are controlled by a DMA (Durect Memory Access) controller 26. DMA 
controUOT 26 can cause values from write registers 24 to be transferred into registers 
in processor elements 12 in any sdected row of array 11 by way of row select lines 
30. Each processor dement preferably has a r^ter 32 reserved for such i/o 
operations. DMA contcoOex 26 can also retrieve data from rei^sters 32 into i^stei^ 
25. 

A preferred implementation has one register 24 (a write register) for 
ddiv^g data to a selected processing dernoit 12 within each column of array 11 
and one register 25 (a read register) for retrieving data from a sdected processor 
dement 12 in each column of array 11. To pass data into array 11, DMA Controller 
26 first places into write rasters 24 the data it wants to place into the array. This 
data is fetched from any suitable memory accesable to DMA controller 26. For 
exan^le, the data may be in a local bu£f(^ memory 34, on anothCT device or 
network accessed via a communication bus 35 or data being received in an input 
video stream 33. A nart DMA Controller 26 sdects a row of array 11 to which tiie 
data in write rasters 24 should be delivered by raiCTgizing one of row select lines 
30. Thei, in each column, data is shifted from write register 24 via i/o line 51 into 
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the i/o lister 32 of the processor dement 12 in tte selected row in time with a 
clock signal. 

Data already in the i/o raster 32 of ttie processor element 12 is 
simultaneously shifted via i/o line 52 to the read register 25 for that column. This 
happens simultaneously for all columns of array 11. In this example, for each 
column, write register 24, the i/o register 32 of a processor element 12 in the row 
selected by DMA controller 26 and read register 25 can be considered to form a 
single 48 bit shift register (16 bits x 3 rasters) which is shifted by 16 bits during 
the data exchange operation. If data in read registers 25 is of mterest then DMA 
controller 26 may copy the contents of read registers 25 to a suitable memory 
device. 

A dock signal is used to dnve Die shift op^atloti. The clock agnal is 
ptefexably carried along a dock agnal data path 28 which extoids fcom near write 
register 24, to the processor d^nent 12 whidi is bdng wiittoi to and bade down to 
near read register 25. This oisures that the dodc signal eq)m»ices very sim^ 
propagation delays to the bits bdng transmitted. Rambus™ and otiier fast memory 
devices use a similar construction. This makes array 11 fully scaleable (i.e. die 
clock speed is not determined by the array aze). The operation is completed when 
read registo: 25 is shifted by 16 clock pulses. Array 11 preferably includes a 
separate i/o clock 27 for regulating the i/o operations. This permits i/o operations to 
be performed asynchronously with, and overlap with, the execution of instructions 
by processor elements 12 as long as the instructions being executed do not read or 
write to i/o registers 32 while the data exchange operation is occurring. For 
example the next image in a video sequwice can be fed into array 11 as processor 
demoits 12 in.array 11 process a previous image. 

Row select lines 30, i/o lines 51 and 52 and i/o registers 32 constitute means 
for sdecting one row and means for simultaneously transferring data fitom each one 
of tile processor el^x^ts in a sdected row into a corresponding read register. In 
Figure 2, i/o lines 51 and 52 and i/o dock Unes 28 are coUectivdy indicated by the 
refiamce numeral 55. 

A sqyarate set of edge i/o r^st»s (not shown)jcould be placed on die IdEt or 
right hand edge of array 11 for reading and writing data firom sdected columns of 
array 11. In the altanative to reading and writing from an entire row of processor 
demrats 12 at the same time, array 11 could be constructed to have a random 
access airangment in wbddi data is written to and/or read from with one specific 
sdect^ processor elemait 12 at a time. 
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Input data for processing by array 11 could come from convaitional 
memory, or from some other device such as a seamier, a video feed, or a network 
interface. Data output data from array 11 can be stored in any suitable memory 
device or sent to another device such as a display or network interfiace. 

IntCTconneQtiQPs of PwessQr El^^Pts ^4 Redwdan Qy 

Figure 3 illustrates the interconnection of processor demits 12 within an 
area llA of array 11. Eadi square represents one processor element 12. Each 
processor elemrat 12 is connected to ^change data with a number of otiier 
processing elements which are located close to it in array 11. Preferably eadh 
processor elemmt 12 is connected to a number N of adjacent processor elem^ts 
which are located on dther side of the processor element in the same tow as the 
processor element and also to a number ilf of othear processor demaits which are 
located on either side of the processor demoit in the same column as the processor 
element. In the embodiment of Figure 3, N=M-9 and each processor element has 
connections to 36 other processor elemmts. Implementations of the invention are 
also possible in which processor elements 12 may be connected to a differrat 
number of neighbouring processor elements in each direction. 

Illustrated processor elemoit 40 is connected to processor elements 41A 
through 411 which are on the same row as processor dlemoit 40 and to the right (as 
viewed in Hg. 3). Processor demoit 40 is also connected to processor elements 
42A through 421 wluch are on the same row as processor damait 40 and to the left. 
Processor element 40 is also connected to processor demoits 43A through 431 
which ate on the same column as processor demont 40 and above jntMessor elraient 
40. Processor demeat 40 is also connected to processor elraiaits 44A through 441 
which are on the same column as iHooessor elemoit 40 and bdow processor dement 
40. The surrounding processor d^eits to which a processor d^ent is connected 
may be called "neighbouring" proce^or demits. The set of a processor dement 
12 and aU of its ndghbouring processor donmts rmy be called a ndghb(Niihood. 
In Figure 3, the miciate neighbouriiood 45 of processor elemait 40 is outlined with 
a thick line. 

Each connection may be implemented by providing a register 46 (Fig. 2) in 
each processor element and circuitry to broadcast the contraits of register 46 to each 
ndghbouring processor dement (e.g. for processor dement 40, the contents of 
register 46 are ddivered to each of processor dements 41A through 441. Register 
46 mayibe termed a "local" broadcast register because it makes a data value 
available to other processor dements 12 in a local neighbourhood. The contents of 
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legister 46 can be made avail^le to all ndyghbouring processor dements. Each 
processor element 12 tiierefore has 36 incoming data connections from neighbouring 
processor elemaits. Preferably, to keq) power consumption low, the contents of 
register 46 are broadcast only upon request of any one of the neighbouring 
processor elements which is comiected to receive the contents of register 46. A 
neighbouring processor element could request that the contents of r^stor 46 be 
broadcast, for example, by briefly inlying a signal to the same bus on which the 
contwits of register 46 can be broadcast. If any one or more neighbouring processor 
elements transmit such a data request signal ikea the dicoitry broadcasts the 
contents of local register 46 to the othw processor elements in the ndghbourtiood. 
The data lines by way of which the contents of local register 46 are broadcast to 
neighbouring processor elements and the circuitry in processor dement 12 which 
drives such data lines constitute means for broadcasting the contoits of register 46 
to ndghbouring processor demoits. 

In the etobo^&meat of Figure 7, each processor elmoit 12 broadcasts to 
ndghbouring processor dem«its 12, or not, dq)«idmg upon a logic value stored in 
a broadcast request geamt&oa. raster 78. Power consumption can be reduced by 
setting broadcast request goieration register 78 to inhibit broadcasting the contMits 
of local regist^ 46 eccept when processing instructions which require results from 
oth» processor elements 12. 

Each processor element 12 preferably has input sdection logic 48, which 
selects a data source for a read operation during any processor cycle. The data 
source could be a selected one of the 36 neighbour processor elements or a different 
data source, such as an incoming data stream or the like. Pr^rably each processor 
elem^t 12 includes a ndghbour access logic unit 49 which sdects data presoited by 
one neighbour in the ndghbouihood of the processor dement for possible access by 
input sdection logic 48. 

Figures 2 and 6 iUustrate one possible inq>lemaitation of ndghbour sdection 
logic 49. In the illustiadon of Figure 6, ndghbour sdection logic 49 connects to 
sets of serial data lines 90. One set 90A of serial data lines connects to neighbouring 
processor dements in the same column as, and above, ^h jnocessor elem^t 12. 
Other sets 90B, 90C, 90D connect to ndghbouring processor dments in other 
directions. Each sd of data Imes comprises a subsd of data lines for carrying data 
in eadi direction. 

Figure 6 shows data fines 92 which are carrying data downwardly from 
above l# two ndghbour sdection logic units 49 of adjacent processor dements 12. 9 
data lines 92 arrive at each neighbour sdection lo0c unit 49. As can be seen in 
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respect of the lowermost one of the two neighbour selection logic units of Figure 6, 
one of the data lines 92A terminates at each neighbour selection logic unit 49. One 
data line 92B originates at each neighbour selection logic 49. Data line 92B canies 
the value in the "local" register of the associated processor elemrait 12 to 
ndghbouring processor dements below. 

Another set of data hnes 92 (not shown) carry local data signals upwardly 
from below in the same column. Fva&ex sds of data lines 92 (not shown) cany 
local data signals from left-to-tight and iight-to-1^ on the same row. 

A switch 93 contains a legists: 50. Logic in switch 93 causes one of the 9 
incoming data lines 92 to be ignored in response to a value in register 50. Register 
50 may, for example, be an 8 bit raster. The logic value of each bit may 
determine which of two of data lines is made available fior s^ection by switdi 93. 
For example, the first bit may select between first and second ones of data lines 92^ 
a second bit may select between the second and a third one of data lines 92 and so 
on. By inserting an appropriate byte value into register 50, 8 of incoming data lines 
92 can be chosen. In response to the value in a data select legista: 23, switch 93 can 
select one of the 8 available inconung data linw for input to processor elemM^t 12 
via line 94. 

This architecture provides a number of advantages: it provides direct access 
to a good number of local processor demoits in the horizontal and vertical 
directions of array 11, supporting many typical imaging opoations and, it provides 
indirect access (through two steps) to an even larger area without incurring the logic 
and wiring ov^head of a direct connectkm. 

As described bdow, the atdiitectute of Figure 3 can be used to provide a 
simplijSed medianism for dealing with any foully processor el^nents 12. The use of 
cruciate ndlghbouihoods, as illustrated in Figure 3, allows finilty processor demraits 
to be bypassed much more simply than could be the case for square 
ndghbourfaoods. In the embodimrat of Figure 3, it can be preferable to use only 32 
connections to ndghbours as active connections and to keep the remaining 
connections to tfie most remote neighbouring processor dements for use as a 
redundant back up as described below. 

It can be appredated that this local broadcast mechanism is contention free: 
the sending processor elemwit does not need to know which of its neighbour 
processor demits requested data from its register 46. No processor dement needs 
to be able to directly write a value to a specific register outside itself. Further, the 
intMCOimections of processor elanents 12 are local in nature. Therrfore, the size of 
array 11 is not limited by the time it takes to broadcast a signal from one processor 



wo 01/90915 



PCT/CAOl/00712 



-18- 

element to all others (as is tiie case for various forms of prior art arrays in which 
individual processor elemaits have broadcast capability). 

Where a processor element 12 is close to an edge of array 11 ih&Kt may be 
fewCT than N(otM) neighbouring processor elemwits on one or more sides. For 
such processor elemaits the data connections which, but for the intervaiing edge of 
array 11, would connect to neighbouring processor elemaits beyond the edge of 
array 11 may be connected to fixed data value (for example zero) so that when a 
processor element close to the edge of array 11 requests data from one of these 
locations the result is simply the value zero. Where array 11 has one dimension 
much smaller than the oth^ dimension then most, or ev&i all, of processor elements 
12 may be dose to an edge of array 11. 

Alteamativeily, the data connections £n>m processor demoits 12 near the edge 
of array 11 could be extraded to external connecticms so that multiple arrays 11 
could be oombmed to create larger arrays. For this lattra- ^proach to be 
implemented it would be necessary to package array 11 in a manner capable of 
providing the necessary data connections. One can appreciate that, wh^ N is 8, 
each procesM>r element at the edge of an array 11 would require at least 8 external 
connections. If array 11 is, for example, a 160 x 120 array thra 4480 data 
connections would be required for die peripheral processor elements alone. The 
number of physical connections could be reduced by multiplexing several data 
connections onto each physical connection. 

In the preferred embodiment of the invention, array 11 is febricated on a 
single chip. Current £sbrication techniques are not perfect. If a large array 11 of 
processor elements 12 is fabricated on a single chip then it is likely tiiat a few of 
processor elements 12 will be d^ective. The embodiment of the invention shown in 
Figure 3 can accommodate sik^ fiiults by eEfectivdy ignoring all processor elements 
in a row or colunm of array 11 in which the &utly processor demi»it 12 resi<tes. 

In this embodim^t of tiie invration each proces«»: demrat 12 has data 
connections to a number of ndghbouring processor dranents in each directicm in its 
row and column. Figures 7 and 8 illustrate a preferred construction for 
accommodating feulty processor demoits 12. A& shown in Figure 7, each processor 
dement 12 comprises a defect logic dement 70. All of die defect logic elements in 
each row in processor array 11 are connected to a row defect r^ter 71. All of the 
defect logic elraiesits for each column in processor array 11 are connected to a 
column defect register 72. Registers 71 and 72 normally contain a first logic value. 
What aiprocessor element 12 at the intersection of a row and column is found to be 
defective, a second value is placed in the corresponding row and colunm defect 
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rcgistera 71 and 72. In response to the second logic value the defect logic elements 
70 to cause processor elements 12 in the affected row and column to be ignored. 

Figure 8 illustrates a possible construction for a defect logic dement 70. 
Each defect logic element 70 comprises four sections. Each section handles signals 
arriving at the processor dement 12 from a different direction along a row or 
column. The two sections 70A which handle signals arriving in a row direction are 
connected at least to the corresponding column defect roister 72. The two sections 
70A which handle signals arriving in a column direction are amnected at least to 
Hie corresponding row defect register 71. All sectioas 70A in a defect logic danent 
70 may be connected to both corresponding defect registers 71. For example, all 
sections may be connected to receive a defect signal that presents the second logic 
value if dtfaer or both of conesponding defect registers 71 and 72 hold the second 
logic value and otherwise presents the first logic value- 
Figure 8 illustrates one secticm 70A. Section 70A has a number of signal 
inputs 75 and a number of signal ou^uts 76. Section 70A comprises a plurality of 
two-way multiplexers 74. Each multiplexer 74 connects one of two input signals to 
its ou^ut. Which signal is connected to the output depends upon the value in the 
corresponding defect register. 

If defect registers 71 or 72 indicate that dther the row or column in which 
the defect logic element 70 is located should be ignored, as indicated by a signal at 
input 77A, then each section 70A simply connects an input 75 to a corresponding 
output 76 so that signals pass through unaltered. If defect registers 71 and 72 
indicate that the processor dement 12 to which the defect logic dem^t 70 
corresponds should be active then section 70A connects an input 77 which recdves 
a broadcast sdgnal from the processor demmt 12 to a first output 76A , discards any 
signd at an input 75H from a &rtfaest ndghbour, and connects inputs 75A through 
756 to outputs 76B through 76H respectivdy. 

In an alternative embodiment there axe connectims to each of the N closest 
processor demits in the same column above and bdow the processor demrat. 
Each processor dement actually uses data connections only to -1 of these 
ndghboiuing processor eJements. Each process element also has connections to 
each of the M closest processor demfaits in the same row to tfie ldEt and right of the 
processor element The processor element actually uses only M-l of these 
connections. For example, in Figure 3, M=N=9 but each processor elemait 12 
actually uses only 8 connections in each direction. Each processor dement 12 
comprise a defect register 50 which includes data which identifies a single row and 
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a single column in each direction to ignoie when receiving data from ndghbouiing 
processor elemoits 12. 

In a "healthy neighbourhood" in which there are no defecte within the 17 X 
17 node region centered on a processor element 40 the defect roister is set so that 
the most distant cells 411, 421, 431, and 441 are ignored. If a column within the 17 
X 17 region needs to be ignored, defect register 50 contains data indicating the 
column to be s)appcd over. Input selecticm unit 48 the« causes tiie column in 
question to be skipped over. Any broadcasts from processor elements 12 in the 
skipped column are ignored. Processor dements 12 within a row can be ignored in 
the same manner. 

A map of which processor elements 12 are defective can be gaieiated dther 
in production testing, or by way of a POST (Power on Self Test) routine which 
executes wh^ array 11 is started up. Software designed for locating defective 
processor elements 12 would ^cecute on array 11 and set the defect registers ^ 
appropiiatdy. Array 11 is preferably fabricated with oiough rows and colunuis of 
processor demrats 12 to accommodate a number of defects and stiU provide an 
array having an effective size suitable for the task at hand. The particular rows ^d 
columns of array 11 which should best be disabled to avoid a particular set of 
defective processor elements 12 can be det^mined by applying a suitable algorithm. 
For example, U.S. patent No. 4,751,656 assigned to IBM corporation describes one 
possible algorithm for choosing the best combination of rows and columns in an 
array to disable for the purpose of removing defective array elements. Figure 4 
illustrates a portion of array 11 having a defective processor element 60. The row 
61 containing the defective processor demmt has be^ disabled. Defect regist^ 50 
of processor element 40 has been set to ignore row 61 and to allow communication 
with processor dement 63 in row 62. 

The for^<Hng arrangem^t permits the accommodation of defects in array 
11 in a very ample maim^. This arrangement cannot compensate for all possible 
distributions of defective processor dements 12. To keep tiie defect logic simple (so 
that it does not impact too significantly on the size of processor dements 12), only a 
single row and angle column can be ddeted on any one side of any processor 
demoit 12. If there are too many defective processor d^nents 12 within a small 
region of array 11 then it may not be possible to n^ove all of the defective 
processor droits 12. 

Where it is not possible to remove all of the defective processor dements 12 
in an aMray 11 there may be a rectangular region within array 11 in wiach. it is 
possible to compoisate for all defective processor demoits 12 as described above. 
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This rectangular region may be used as a smaller array 11. Thus chips which 
incorporate processor arrays 11 according to the invention in which there are a 
number of defective processor dements can still be used for tasks for which a 
smaller working array area will sufiRce. Thus processor arrays according to the 
invraition can be made with a higher effective yield than would be the case if 
processor arrays 11 which include def«Jtive processor elements 12 were suitable 
only for scxap. 

Since, in the prefisixed embodiment, processor demits only communicate 
directly witii other processor elements which ate located physically dose by, array 
11 can be easily scaled. It is not necessary that a processor dement 12 in one part of 
array 11 be executing an instruction at «cactly the same time as anotha: processor 
dement 12 in a lemote part of array 11. All that is necessary is that the iiifonnation 
in broadcast instruction streams 14 and data streams 16 should take about the same 
amount of time to reach any given processor ekmssat 12 from array program and 
data memory 20 (or some other source(s) of instructicms and data). 

Instruction Set 

The architecture described above can be used in many diff«»nt contests. 
Processor elements 12 may be implemoited in various ways. The following is an 
sample of an instruction set that may be implemented by processor dements 12. 
Ttie invention is not limited to this instruction set which balances simplidty (so that 
the area of instruction decode logic is not too large), functionality, and effideat 
execution of common operations. 

In this example, the instructions opecatB on a 128 register i^ace. Some of the 
128 register slots in this register space are assodated with physical data storage. 
Others T&f&: to read-only data streams and data from neighbouring processor 
elemoits. One possible register mapping is as follows: 



TABLEH 


Register(s) 


Description 


rO,rl 


Gea&al purpose registers "A" and "B" 


r2 


local (broadcast to neighbours) 


r3 


global Qinlced to edge i/o inters) 


r4 


opemti (right shift function, and byte swap cqnbility via 
operand2) 


r5 


opaand2 (byte swj^ped representation of operand) 


r6 


row (general reg,, but typically stores row address of PE 



wo 01/90915 



PCT/CAOl/00712 



-22- 



i7 


col (g«ieral reg., but typically stores column address of PE) 


r8 


instrSd (3 bit instruction stream select) 


i9 


dataSd (7 bit register address for data or neighbour 
selection) 


rlO 


status (6 bits of condition flags and state control flags) 


rll 


defect (16 bit defect control register) 


rl2 (r/o) 


DataStr (Data or neighbour stream selected by dataSd) 


rl3, rl4 


reserved for future use 


rl5 (r/o) 


-I (constant value for inccement and deotonoit) 


rl6 ... 

r31 


general registers (few^- than all of these may actually be 
used) 


r32... 
r63 (r/o) 


broadcast data stieams (fewer than all of these may actually 
be used) 


r64... 
rl27 (r/o) 


data from ndghbours (fewer than all of these may actually 
be used) 


(r/o) means "read only" in this Table 



Providing a xegistw, sudi as rS, which contains a byte-swapped 
representation of an operand is particulaiiy useful for packing and unpacking single 
byte values into 16-bit registers (e.g. so that two 8-bit pixels can be easily stored in 
an single register). 

In tiiis example, every instruction is a 16 bit value which has the stiiicture: 
< predicate > <opMation> . The execution of the instruction is controlled by a set 
of two predicate condition bits. The predicate bits sdect one of four execution 
options based on the current settings of condition flags in the status register 53 
(which have been set by an earlier result). Depending upon the value of the 
predicate, the instruction will either execute only if tfie condition flags indicate tiiat 
a previous result was less than zero; execute only if the condition flags indicate that 
a previous result was equal to zero; execute only if tiie'condition flags indicate that 
a previous result was greater than zeto; or 

always execute without regard to ttie settings of the condition flags. The use of a 
predicate to control execution of instructions permits the efficient execution of short 
conditional sequences for which flje overiiead of switching to a different stream is 
not wamnted or where the operatic of switching to a different stream is itself 
conditional. 
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Cases where conditions need to be combined (e.g. whrae it is desired that 
the instruction should execute if the previous leadt was either greater than or equal 
to zwro) can be accommodated by using one extra instruction (e.g. the extra 
instruction could test the condition flags, if they indicate "greater than zero", then 
the instruction could set the condition flags with a 0 value. The next instruction 
could th^ u» a predicate which tests for "equal to zero"). 

The opexa&oa portion of each instruction is 14 bits wide and has one of two 
possible structures depmding upon whether it specifies an operation to perform 
using tiie ccmtrats of two niters or whetiior it spedfles a value to write to a 
register. For an operation which operates on the values on two registers tiie 
structure of the < operation > field is: 
<lhs> <rhs> <alu-pp> <n^te> <test> 
Where: 

<lhs> is a 2 bit value specifying one of Uaee rasters (A, B, or "local"); 
<rhs> is a 7 bit value spedfying any regista; 

<alu-op> is a 3 bit value defiaiing tiie ALU operation to perform betweea Ihs and 
rhs (see Table m); 

< negate > is a 1 bit field which, if set, causes the rhs value to be negated prior to 
use in tiie ALU; and, <test> is a 1 bit field which, if set, causes the result of the 
ALU operation tiiat is returned to tiie Ihs register to update the craidition flags in the 
statiis register. 

For an instruction which loads a value into a i^;ister the < operation > field 
has the structure <mark> <reg> <data> 

where: < mark> is a two bit fidd tiiat, indicates tiiat this is a raster loading 
operation; < reg > is a 4 bit fidd which spedfies any one of the first 16 legists 
(rO ... rlS); and <data >is an 8 bit fidd cmtaining an immediate value to load 
into tiie £^pecified register. This instruction is pre^bly performed in a single 
bit-paraUd (latohing) opoation. This ensures tiiat when ttie instrSd (instruction 
stream select) raster is the destination, the correct instruction stream is sdected in 
time to recdve the next instructicm on tiie selected instruction stream. If tiiis control 
raster load operation whcoe done in bit-serial fashion, die control raster would 
not be updated in time for the next instruction to be read in bit-serial manner from 
the correct stream, in which case there would always need to be a single extra nuU 
op&cA^on following an instiuction stream switch to wait for the diange to come into 
effect. As noted elsewhere, for effidency it is preferable ttiat each processor 
dement is reading a next instruction while it is executing a curreant operation. 
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The status resgister includes two "mode control" bits for control of left shift 
and light shift operations. These mode control bits are useful for efficient 
implementation of multiplication. The <rs> (right shift) mode control bit, if set, 
causes register rl to be shifted right by one bit prior to each operation. The least 
significant bit of the register is moved to the add-enable flag within the ALU, and 
the most significant bit is sign extended (register rl is used to store one of the 
operands for a multiplication operation). If the operation-enable bit of the ALU is 
set, the operation is performed, otherwise no operation is performed. If the <rs > 
field is 0, the operation-enable bit is always set to 1. 

The < ls> (left shift) mode control bit, if set, causes the ilis result to be 
shifted left by one bit after the operation has completed, the least ^dgnificant bit 
bdng 1^ to 0. 

Table ni is an example of operations that may be performed by ALU 13. 



TABLE m ALU OP CODES 


OP 


OPERATION 


VALUE 


VALUE 


CODE 




RETURNED TO 


RETURNED TO 






LHS 


RHS 


0 


no op (nuU operation) 






1 


AND 


Ihs - Ihs AND 


rhs- riis 






rhs' 




2 


OR 


Ihs - Ihs OR rhs' 


rhs -rhs 


3 


XOR 


Ihs - Ihs XOR 


rhs - rhs 






riis' 




4 


add 


Ihs - Ihs + rhs' 


riis- rhs 


5 


copy rhs' to Ihs 


Ihs - ifas' 


ills - rhs 


6 


copy Ihs to ihs 


Ihs -Ihs 


rhs - Ihs 


7 


swap Ihs and rhs' 


Ihs <- rhs* 


rhs - Ihs 


if the < negate > bit is set in the instructicm, rhs' .= rriis, otherwise rhs' = rhs 



Example 1 - Conditional Btanches 

The architecture d^cribed above permits various paialld data processing 
operations which axe not readily fieasible with prior aFchitectures. For example, with 
an architecture which provides sev«al concurrrait instruction streams to processor 
elements 12 conditional branches can be performed with enhanced efficiency. If we 



wo 01/90915 



PCT/CAOl/00712 



-25- 

assume that initially all processor elements 12 in array 11 are receiving instructions 
from a first instruction stream "stream 0" , and "InstrSel" is the name of the 
instruction select register 22, then a pseudo code program which included a 
conditional branch could be constructed as shown m Table IV. 



TABLE IV 


Stream 0: 


Stream 1: 


if (rO * 0) then 

InstrSel = l; 


nop 


/* Sequence A */ 


/* Sequence B */ 


nop 


InstrSel = 0; 



The first instruction m stream 0 causes any processor eletnents 12 with a 
non-ZMO value in register rO to switch to stream 1, The two different sequences, 
seijuence A and sequence B are then executed in parallel- After both sequences A 
and B have been completed, die last instruction m stream 1 causes any processor 
elements 12 executing instructions in stream 1 to switch back to stream 0. If one 



sequence of instructions is shorter than the other, flie processor elements 12 execut- 
ing that stream can execute null operations (nop) while the processor elements 12 
executing the other stream con:q)lete the other sequence. 

Example 2 - Table Lookup 

With an array 11 according to the invention which has both multiple instruc- 
tion streams 14 and multiple data streams 16 a table lookup operation can be 
executed in a reduced number of cycles. Further, each processor element can 
retrieve a value from the table while performing fewer operations. This can result in 
lower power consumption by array 11. In one approach a lookup table can be 
divided into a rnmiber of approximately equal-sized blocks. Preferably the table is 
divided into the same number of blocks as there are available data streams 16. Each 
block might correspond, for example, to a data value within a certain range. Each 
processor element 12 m array 11 has a register containing a data value to be looked 
iq) in the table. Eadi processor elanrart 12 performs instructions which cause it to 
mspect the data value and to identiPy from the data value one of the blocks corre- 
spondmg to the data value. The processor elemisnte 12 switch to monitbrin^ a data 
stream on which title selected block will be broadcast. The blocks are then broadcast 
in parallel to the processor elements 12 on the multiple data streams 16. Each 
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processor element retains a value from the table which corresponds to the data value 
being looked vap. 

As a simple example, array 11 has 4 instruction streams 14, and 4 data 
streams 16 and each processor element 12 has a value in its rO register which is an 
mdex of the desired table el^ent is in register rO of each processor element 12. 
The index is in the range 0 to 63 (the first element of the table has an index of 0, as 
is common practice in programming languages such as C). Data in the selected data 
stream is referred to as "DataStr". Hie looked \xp value is stored in register rl. A 
lookup in a table having 32 values could be implemented as shown in Tables V and 
VI. 



TABLE V 




Instruction Streams 




0 


1 


2 


3 




dataSel-rO 








2 


r0»2 








3 


l"0=i"0 >>4 


rO*=rO>>4 


rO*"rO >>4 


rO'*"rO>>4 


4 


if (rO-0) 
then 

rl*=DataStr 








5 


decrement 
rO 


if {rO=0) 
then 

rl*-DataStr 






6 




decrement 
rO 


if (rO-0) 
then 

rl*-DataStr 




7 






decrement 
rO 


if (rO=0) then 
rl*-DataStr 


8 


if (rO=0) 
then 

rl*-DataStr 






decrement rO 


9 




if (rO=0) 
then 

rl*-DataStr 






10 






if (fO=0) 
then 

rl«"DataStr 




11 








if (rO=0) then 
rl*=DataStr 


12 




InstrSel<"0 


InstrSel«-0 


InstrSel<=0 
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TABLE VI 


Cycle 


Data Streams 


0 


1 


2 


3 


1 










2 










3 










4 


Tbl [0] 


Tbl [1] 


Tbl [2] 


Tbl [3] 


5 


Tbl [4] 


Tbl [5] 


Tbl [6] 


Tbl [7] 


6 


Tbl [8] 


Tbl [9] 


Tbl [11] 


Tbl [11] 


7 


Tbl [12] 


Tbl [13] 


Tbl [14] 


Tbl [15] 


8 


Tbl [16] 


Tbl [17] 


Tbl [18] 


Tbl [19] 


9 


Tbl [20] 


Tbl [21] 


Tbl [22] 


Tbl [23] 


11 


Tbl [24] 


Tbl [25] 


Tbl [26] 


Tbl [27] 


11 


Tbl [28] 


Tbl [29] 


Tbl [30] 


Tbl [31] 


12 










All processor elements 12 are initial 


ily executing instnictions from instruction 



stiream 0. "dataSeF refers to tbe data stream selection register. The table values 
(indicated using the indexing notation of "tbl[index]" are sent via the multiple data 
streams 16. Blank entries in the tables are intended to represent null operations and 
data values. 

Tlie &st instruction places the index into the data source register dataSel. 
Because the data source is only a 2 bit value, the effect is that bits 0 and 1 of the 
index are used to set the data source register. The next instruction sets the 
instruction stream. As for the data source register, the instruction source regista: is 
only 2-bits. So the effect of the second instruction is to place bits 2 and 3 of the 
index into the instruction source register (rO> >2 indicates shifting rO by two 
places). 

The final preparatory step, which is performed by every processor element, 
is to shift rO by 4 places, fliereby leaving rO v/ith the remaining bits of the index (in 
this case only bit 4 is used as the index is a S bit value). This, combined with the 
instruction stream selection, determines on which row processor element 12 
accesses the desired table value and stores the table value in register rl. 

After processor elements 12 have executed these preparatory steps then the 
table is broadcast, as shown in Table m, in synchrony with the instruction cycles. 
Because flie size of the table is twice tiie product of instruction streams and data 
streams (4 x 4 = 16, while the table is 32 elements), the table is effectively divided 
into two 16 element blocks, and these two mam blocks are further divided into 4 
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sub-blocks (one per data stream); While the table is broadcast, those processor 
elements 12 which are executing each instruction stream wait for one of two specific 
cycles in which the appropriate table elements are bemg broadcast. For example as 
shown in cycles 4 and 8, the processor elements 12 which are executing the 

5 instructions of instruction stream 0 select data values from ehher the first set of 

table values being broadcast or the 4"" set of table values being broadcast depending 
on the value of the fifth bit of the index initially stored in register rO. When rO is 
zero, then the processor element 12 access the data stream selected in cycle 1 and 
places the data value *data mto register rl. The final instruction at cycle 12 returns 

iO all processor elements to instruction stream 0. 

The overall result is that the time taken to apply the table look-vjp has been 

reduced in proportion to the number of data streams (excluding set up and clean up 
time which is a small overhead for large tables). This is a significant mq)rovement 
over conventional SIMD array architectures. It is also an improvement over serial 

5 architectures, such as architectures usmg one or more RISC, CISC or VLIW 

processors. This method reduces the amount of data that needs to be fetched from 
the date matnory of array controller 18 because each element of the array only needs 
to be fetched a single time. It is straight forward to extend this to larger tables, and 
to use more or fewer mstruction streams or more or fewer data streams. 

to Processor elements 12 preferably coni^rise circuitry which uses very little 

power when the processor element is executing a null operation ("NOP"). It can be 
seen from inspecting Table n that each processor element is idle during 
approximately one half of the processing cycles required for the table lookup 
operation. In other architectures table lookup operations requure much higher 

t5 processor utilization with a commensurate increase in energy consuxoption. 

Example 3 - Matrix Operations 

Certain matrix multiply operations (such as used for the discrete cosine 
transform commonly used in image compression) require that different processor 

iO elements perform calculations usmg different matrix coefficients based on thek 

position within a local matrix. In these cases the processor element needs to have its 
own row and column position stored in its registers for use in chooshig the 
appropriate stream. Appropriate coefficients can be effectively delivered to the 
different processor elements 12 by way of the multiple data streams provided in 

'< 5 arrays according to prefarred embodimenis of the invention. 
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Example 4 - Miscellaneous Image Processing Operatioiig 

Image processing operations whose behaviour changes near an image 
boundary, as is conanon for area operations such as filtering, can easily switch 
processor elements 12 responsible for processing boundary pixels to a different 
instruction stream to implement their different behaviour. Different instruction 
streams can also be used to make processor elements 12 responsible for processing 
even rows of pixels in mterlacing or de-interlacing operations perform differently 
from those processor elements 12 responsible for processing odd rows of pixels. 

Example 5 Bl ending Two Images 

This simple example assuioes that there are two 160 pixel x 120 pixel grey 
level images. Pixel values for a first one of the images are stored in register rl6 of 
eadi processor elaoaent in a 16© X 120 array of processor elements 12. Pixel values 
for a second image are stored in register rl7 of each processor element. The 
objective is to place a blended result given by the formula (rl6 + rl7)/2 into a 
register rl8. rO is used as a temporary register for the operation. A sequence of 
instructions for execution on each of processor elements 12 which can accomplish 
this resxilt is shown in Table VH. 



TABLE Vn 


INSTRUCTION 


COMMENTS 


rO = rl6; 




rO = rO + rl7; 




operand = rO; 


put sum of rl6 and rl7 in an operand register 
which supports the right shift function 


status = RSHIFT_ON; 


turn on right shift mode to divide by 2 


status = 0; 


turn off right shift mode 


rO — operand; - 


get result 


rl8 = rO; 


put result in rl8 



This operation can be concpleted in seven instruction cycles for any size of array. 



Example 6 Column Addition 

This example begins with each processor element in an array 11 having 120 
rows hotfling a pixel values for a grey scale image in a register rl6. The objective is 
to add up the pixel value in each column of array 11 and to place the result in a 
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register r2 of processor elements 12 in row 0 of the array. A sequence of 
instructions for execution on each of processor elements 12 which can accomplish 
this objective is shown in Table Vin. In the syntax of this example, "onQ" 
identifies specific instruction streams which execute the given mstruction. Other 
instruction streams contain null operations, "par" followed by block containing 
several different instructions means that the instructions in the block are each 
provided in a separate instruction stream and execute in parallel. The first 
instruction m the block is delivered in stream 0, die second instruction is delivered 
in stream 1, and so on. 



TABLE Vni 


INSTRUCTION 


COMMJilN 1 i> 


local = rl6; 


Make contents of register rl6 available to neighbours 


rO = row; 


Get row number of the processor element 


instrSel »= row; 


Select instruction stream based on the lower 3-bits of the 
row number 


on (0,2,4,6) local = 
local + downfl]; 


Instruction streams for processor elements on "evCTt" rows 
cause the processor elements to add the value from toe 

pXUuCoouf ClwUlCtUl UCxUW LUCIU JiJl UICU iA/lUliiil lU UlS VcUuC 

in their "local" register 


on (0,4) local = 
local + downCZ]; 


Instruction streams for processor elements in every fourth 
row cause the processor elements to add the value from the 
processor element two below them in their column to the 
value in their "local" register 


on (0) local = local 
+ down[4]; 


Processor elements runnmg instructions in stream 0 add 
the value firom the processor element four below them in 
their column to the value in their "local" register. 


on (0) repeat(rows/8 
-1) 




{ 




local = local + 
down[8]: 


Every processor element running instructions in stream 0 
adds the value from the processor element four below in its 
column to the value in its "local" register. 


} '* 
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par {on (0) local = 


Switch mma 


sed PEs back to original instraction stream 


local + down[8];on 






(ItoT) mstrSel = 






0;} 







This exanqjle illustrates how tihe cruciate neighbourhood structure described 
above can speed up operations involving summation over rows or columns of an 
array. An array with only nearest neighbour access would require at least 120 cycles 
to complete this operation. This example shows that the operation can be performed 
in only 21 cycles in a system according to the invention. Processor elements which 
are not required to perform a calculation m any given cycle preferably execute null 
operations This does not change ttie result computed, but significantly reduces 
power consumption. 

Example 7 MPEG Pattern Match 

An MPEG macro-block is a 16 x 16 pixel region of an image. This is the 
size of the regions used for motion estimation. A typical method for comparing one 
region with another is called "sum of absolute difference". The pixels in the two 
regions are conq)ared by taking the absolute difference betweai corresponding 
pixels, and then summiog this up to produce a match, score. A low value indicates a 
better match than a high value. The following code illustrates the comparison 
process for an 8 x 8 region (it is common to first subsample the image by a fector of 
2 for a faster initial search, and then refine that search). 

The most time consuming portion of this task is the summation across all the 
cells m the 8 X 8 region. The followmg example illustrates one way to perform 
motion estimation in a system according to the invention. This example can achieve 
a high level of utilization of processor elements 12 and completes one motion 
estimation cycle in just 8 Instruction cycles. 

A 8x9 block of processor elements 12 is used to process each 8x8 region of 
an image. One row of 8 processor elements 12 is used to perform final post- 
processing work of row summation and mmima test. The resultmg motion 
esthnation vector ends up in tibie top left processor element of the block. Table Vn 
shows how 8 instruction streams, nun&ered 0 through 7 are allocated to the 
processor elements in the 8 x 9 block. 
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The pixel values for the reference block are in register rl of processor elements 12. 
The pixel values for the input block that is being tested for similarity with the 
reference block are stored in registers rO. Each processor element is executmg 
instructions from the instruction stream identified in Table IX. 



Table X shows a sequence of 8 instructions executed by each of the first four 
instruction streams associated with the 8 x 8 block. 



TABLE X 


on (0.. 3) local = rO; 


make input image pixel available to neighbours 


par 




{ 




on (1.. 3) local = 
up[l]; 


shift input image down by one pixel 


on (0) local = iq)[2]; 


processor elements in row 1 executing streamO must skip 
over row of post processing processor elemmts above 
them 


> 




on(0..3) 




{ 




rO = local; 


save the new input pixel 


local = local - rl 

{?}; 


compare against reference pixel and test 


{<0?} local = 
-local; 


absolute value 


} ^4 
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on (0. .2) local = 
local + down(13; 


Now sum up fbs columns in the 8x8 region 


on (0,1) local = 
local + down[2]; 




on (0) local = local 
+ down[4]; 


column sum now ready for pickup by post-procesing row 



Table XI lists a sequence of 8 instructions executed by processor elemente in 
die added post-processing row which are executing the last four instruction streams. 



TABLE XI. 


on (4. .7) local 
= down[l]; 


Fetch result of last cycle ftom top of 8x8 block 


on (4. .6) local 
= local + 
right[l]; 


Continue summation along row 


on (4.. 5) local 
= local + 
right[2]; 




on (4) 




{ 




local = local + 
right[4]; 


Finished summation 


rO = min; 


Fetch current mioimum 


rO = rO - local 

{?}; 


Compare new sum against min 


{>0?} min = 
local; 


If better, store new min 


{>0?} rl = 
shiftPosition; 


and store associated motion vector 


} 





It can be appreciated that in the currentiy preferred embodiment of the 
invention, which is described above, diere are enough registers 21 within each 
processor element 12 to hold values for 16 8-bit pixels, with additional free registers 
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to pCTform useful work on these image values. The result is that an entire 640x480 
8-bit image can be held in array 11. Alternatively, significant portions of multiple 
images can be held in the array at the saoae time (e.g. four 320x240 images or 
sixteen 160x120 images). For applications such as pattern matching, this means that 
the reference image can be kept in the array at all times, rather than needing to 
repeatedly fetch it from memory. This results in significant image processing 
performance unprovements because it substantially reduces the overhead of fetching 
and storing image data that is inherent in a serial processor architecture (e.g. RISC, 
CISC, or VLIW). 

As will be apparent to those skilled in the art in the light of the foregoing 
disclosure, many alterations and modifications are possible in the practice of this 
invention without departing from the spirit or scope thereof. For exatrqple, while the 
data paths within array 11 have been described as serial data paths the architecture 
of the invention could also be used with parallel data paths. Data paths 16 and 
mstruction paths 14 could be interchangeable. 

While the logical values of flags or bits have been referred to herem as being 
"1" or "0" to represent logical conditions of TRUE and FALSE respectively, any 
distinct signals could be used to represent these logic levels. 

While each instruction stream and each data stream may be carried on a 
separate bus, it would be possible in some mibodmients of the invention to 
multiplex several data and/or instructions streams on a single bus. 

For clarity, certam elements, such as power supplies, power connections, 
some clock lines and the like, have been omitted from the above drawings and 
description. Such elements are known to those skilled in the art and are therefore 
not described herein. For sake of illustration only, power connections may be 
provided to processor elements 12 by way of a power bus extending parallel to row 
select lines 30. 

Accordingly, the scope of the invention is to be construed in accordance with 
the substance defined by the following claims. 
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WHAT IS CLAIMED IS: 

1. A processor array comprising a plurality of interconnected processor 
elements, a plurality of instruction buses connected to each of the processor 
elements, at least one data bus connected to each of the processor elements 
and a instruction selection switch associated with each of the processor 
elements, each processor element connected to execute instructions from a 
one of Ihe plurality of instruction buses selected by its instruction selection 
switch. 

2. The processor array of claim 1 wherein eadi of the processing elements 

conq>rises an instruction bus selection regist^ and the instruction selection 
switch is constracted to select a one of the phu:alily of instruction buses 
corresponding to a data value in the instruction bus selection register. 

3 . The processor array of daim 1 comprising a plurality of data buses 
connected to each of the processor elements. 

4. The processor array of claim 3 comprising a data selection switch associated 
with each of the processor elements, each processor element connected to 
receive data from a one of ^ plurality of data buses selected by its data 
selection switch. 

5. The processor array of claim 4 wherein each of the processing elements 
comprises a data bus selection register and the data selection switch is 
constructed to select a one of the plurality of data buses corresponding to a 
data value m the data bus selection register. 

6. The processor array of claim 1 wherein each of the processor elements is 
connected to send data to other processor elements ia a cruciate 
neighbourhood. 

7. The processor array of claim 1 wherein the processor elements are arranged 
in a plurality of rows and a plurality of columns and each of die processor 
elements has direct data connections to at least one other processor element 
ii the same row as the processor element and at least one other processor 
element in the same column as the processor element. 
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The processor array of claim 7 wherein each processor element has direct 
data coimections to a plurality of neighbouring processor elements on eadi 
side of the processor element in the same row as the processor element and a 
plniaiity of neighbouring processor elements on eadi side of the processor 
element in the same column as the processor element. 

The processor array of claim 6 wherein each of the processor elements 
comprises a local register and the processor element is connected to 
broadcast data in the local register shnultaneously to ottier processor 
elements in the cruciate neighbourhood. 

The processor array of claim 9 wherein each of the processor elements 
comprises a cncuit connected to receive a data request signal mdicatmg that 
at least one other processor element in the neighbourhood has requested that 
the contents of the register be broadcast and the circuit is adapted to 
broadcast the contents of the regist^ only if a data request signal has been 
received. 

The processor array of claim 9 conoprising a broadcast request generation 
register connected to each of the processor elements, wherem broadcasting 
the contents of the register is mhibited when the broadcast request generation 
register contams a first logic value. 

The processor array of claim 6 wherein each of the processor elements 
comprises a register and selection logic the selection logic configured to 
receive data from a particular one of the other processor elements in the 
cruciate neighbourhood as determined by a value in the register. 

The processor array of claim 6 wherein the cruciate neighbourhoods each 
con^rise four arms radiating from a processor element and each arm 
conqprises at least two processor elen^nts. 

The processor array of any one of claims 1-13 wherein a ratio of the number 
of processor elements in the processor array to the number of instruction 
buses in the processor array is greater than 100:1. 
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15. The processor array of any one of claims 1-13 wherein a ratio of the ntunber 
of processor elemraats m the processor array to the number of instruction 
buses in the processor array is greater than 1000:1. 

16. The processor array of any one of claims 1-15 wherein the data buses 
comprise serial data buses. 

17. The processor array of any one of claims 1-16 wherein the instrucdon buses 
comprise serial instruction buses. 

18. The processor array of claim 1 comprising a plurality of data streams 
connected to each of the processor element. 

19. The processor array of any one of claims 1 to 18 packaged on a single 
mtegrated ckcuit. 

20. The processor array of any one of clahns 1 to 19 wherein the processor array 
conqprises at least 10,000 of the processor elements. 

21. The processor array of claim 1 wherem each of the processor elements is 
located at a node of a grid con^)rising a plurality of rows and a plurality of 



The processor array of any one of claims 1 to 21 wherein each of the 
processor elements comqprises a plurality of registers of a type which require 
dynamic refreshing. 

A processor array comprismg a plurality of mterconnected processor 
elements, each of the processor elements logically arranged at an intersection 
of a row and a column in a grid conq>rising a plurality of rows and a 
plurality of columns, each of the processor elements connected to transmit 
data to other processor elements in a neighbourhood comprising a plurality 
of neighbourmg processor elements, the phirality of neighbouring processor 
elmtents comprising a number N> 1 of processor elements m the column on 
either side of the process(»c element and a number M > 1 of processor 
Elements in the row on either side of the processor element. 
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24. The processor array of claim 23 wherein iV s 4 and 4. 

25. The processor array of claim 23 wherein M = AT = 2", wherein » is an 
integer and n^l. 

26 . The processor array of claim 25 wherein iV ^ 8 and 8 . 

27. The processor array of claim 23 wherein the neighbotirhood comprises a &st 
number of neighbouring processor elements in the column on a &st side of 
the processor element and a second number of processor elements in the 
column on a second side of the processor elexoent. 

28. The processor array of claim 23 wherein each of the processor elements 
comprises a register and selection logic the selection logic configured to 
receive data from a particular one of the other processor elements in the 
neighbourhood as determined by the value m the register. 

29. The processor array of claun 23 wherein each of the processor elements 
conq)rises a plurality of registers of a type whidh require dynamic 
refreshing. 

30. The processor array of claim 23 wherem one or more mstruction buses are 
connected to deliver a plurality of instruction streams from an instruction 
source to each of the processor elements, one or more data buses are 
connected to deliver at least one data stream from a data source to each of 
the processor elements and one or more clock buses are connected to deliver 
a clock signal from a clock to each of the processor elements, wherein, for 
each of the processor elements, propagation times to the processor element 
from the data soince on the one or more data buses, from the instruction 
source on the one or more instruction buses and from the clock on the one or 
more clock buses are substantially the same. 

31 . The processor array of claim 23 wherein ^ch of the processor elements 
comprises an i/o register and the array conq>rises a set of read registers, the 
resd registers comprising one read regist^ for each of the columns, a first 
#0 data line connectmg each i/o register to a correspondmg read register; 
and, row select logic connected to select all of the processor elements in one 
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of the rows, wherein, when one of the rows is selected, data from i/o 
registers of processor elements in the selected row is written to the 
coiresponding read registers by way of the &st i/o data lines. 

32. The processor array of claim 3 1 comprising an output system clock and 
circuitry for moving data from the i/o registers to the read registers in time 
with a clock signal g«ierated by the output system clock. 

33. The processor array of claim 32 wherein the processor array comprises a 
processor tuning clock, which is separate from the output system clock, the 
processor timing clock providing a clock signal to each of the processor 

elements. 

34. The processor array of claim 32 comprising a plurality of write registers, the 
write registers comprising one write register for each of the columns, and a 
second i/o data line connecting each i/o register to a conesponding write 



35. The processor array of claim 34 wherein the first and second i/o data lines 
are serial data lines and the processor array is configured to bitwise shift a 
value from a write register to the i/o register of a corresponding processor 
element in a selected row and to shnultaneously bitwise shift a value from 
the i/o register of the corresponding processor element to the corresponding 
read register. 

36. The processor array of claun 23 wherein each of the processor elements 
conq>rises means for simultaneously broadcasting the contents of a local 
register to all other processor elements ui the neighbourhood. 

37. The processor array of claim 23 comprising a plurality of read registers, one 
read register correspondiag to each of tiie columns, means for selectmg one 
of the rows and means for simultaneously transferring data from each one of 
the processor elements in a selected row into a corresponding read register. 

38 . A method for operatiog processor array conqjrising a plurality of processor 
dements, each of ttie processor elements comprising a plurality of registers, 
each of the plurality of registers in each of tiie processor elements 
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conq>risiag registers which require dynamic refireshiog at a refresh 
frequency, the in^hod conprising: 

a) providing one or more streams of instructions to each of the 
processor elements for execution by the processor elements; and, 

b) periodically inserting into the one or more instruction streams register 
refresh instructions, the register refresh instructions causing the 
processor elements to rewrite data values in the registers. 

A method for operating a processor array having a plurality of 
interconnected processor elements, the method comprising: 

a) providing an array of processor elements, each of the processor 
elements logically arranged at an intersection of a row and a column 
m a grid comprising a plurality of rows and a plurality of columns, 
each of the processor elements connected to transmit data to a 
plurality of neighbouring processor elements, the phirality of 
neighbouring processor elements comprisiag a number Nof 
processor elements in the column on either side of the processor 
element and a number M of processor elements in the row on either 
side of the proc^sor element; 

b) determining when one or more of the processor elements is defective; 
and, 

c) for each defective one of the processor elments, configuring the 
array to ignore the row and column contakdng the defective one of 
the processor elements. 

A method for in^lementing a table lookup operation in a processor array, 
the method comprismg: 

a) providmg a processor array comprising a plurality of processor 
elements; 

b) providing multiple data streams to each processor element; 

c) providing a lookup table con^rising several parts each part 
corresponding to a range of values, each of the parts conq>rising one 
or more table values; 

d) simultaneously transmitting the several parts of the lookup table on 
the multiple data streams; 

'"b) at each processor element selecting a data stream to access as a 
ftmction of a data value m the processor element; and. 
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at each processor element retrieving from the selected data stream a 
table value corresponding to the data value of the processor element. 
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